House Sales Prediction in King County, USA

Introduction

  • Kaggle의 오픈 데이터에서 강의용 데이터로 쓸만한 회귀분석용 데이터를 찾던 중 이 데이터를 찾게 되었다. 링크

  • 이 데이터셋은 2014년 5월부터 2015년 5월까지 매매된 King County의 집값 데이터를 포함하고 있다.

  • 이 데이터를 활용해서 강의에서 전달하고자 하는 바는 Feature EngineeringData Exploration의 힘이다. 정말 간단한 몇 개의 과정을 반복하는 것만으로도 예측 모델의 성능을 어디까지 끌어올릴 수 있는지를 보여주고자 했다.

  • 강의를 위해서 준비한 자료를 공유하는 것이기 때문에, 자세한 코멘트는 달지 않았다. 추후 시간을 더 들여서 자세한 설명을 담은 포스트를 작성하고자 한다.

Tying Shoes

### Load the libraries
library(lubridate)
library(readr)
library(dplyr)
library(ggplot2)
library(GGally)
library(corrplot)
library(ggmap)
### Load the dataset
House <- read_csv("data/kc_house_data.csv")
## Parsed with column specification:
## cols(
##   .default = col_integer(),
##   id = col_character(),
##   date = col_datetime(format = ""),
##   price = col_double(),
##   bathrooms = col_double(),
##   floors = col_double(),
##   lat = col_double(),
##   long = col_double()
## )
## See spec(...) for full column specifications.
head(House)
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
7129300520 2014-10-13 221900 3 1.00 1180 5650 1 0 0 3 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650
6414100192 2014-12-09 538000 3 2.25 2570 7242 2 0 0 3 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639
5631500400 2015-02-25 180000 2 1.00 770 10000 1 0 0 3 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062
2487200875 2014-12-09 604000 4 3.00 1960 5000 1 0 0 5 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000
1954400510 2015-02-18 510000 3 2.00 1680 8080 1 0 0 3 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503
7237550310 2014-05-12 1225000 4 4.50 5420 101930 1 0 0 3 11 3890 1530 2001 0 98053 47.6561 -122.005 4760 101930

Visualizing (Heat Map)

### Initialize a map for King County
kingCounty <- get_map(location = 'issaquah',
                      zoom = 9,
                      maptype = "roadmap"
)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=issaquah&zoom=9&size=640x640&scale=2&maptype=roadmap&language=en-EN
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=issaquah
### Generate a heat map
ggmap(kingCounty) + 
    geom_density2d(data = House, aes(x = long, y = lat), size = .3) + 
    stat_density2d(data = House, aes(x = long, y = lat, fill = ..level.., alpha = ..level..), size = 0.01, bins = 16, geom = "polygon") + 
    scale_fill_gradient(low = "green", high = "red") + 
    scale_alpha(range = c(0.2, 0.4), guide = FALSE)

  • 대부분의 데이터가 시애틀 지역을 기반에 두고 있으며, 외곽의 버클리나 스노퀄미 등의 지역의 데이터도 포함되어 있다.

Data Preparation

House %>%
    mutate(
        sale_year = year(date),
        sale_month = month(date)
    ) %>%
    select(-id, -date) -> House
set.seed(2017)
trainIdx <- sample(1:nrow(House), size = 0.7 * nrow(House))
train <- House[trainIdx, ]
test <- House[-trainIdx, ]

Benchmark

bench_model <- lm(price ~ ., data = train)
summary(bench_model)
## 
## Call:
## lm(formula = price ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1289272   -98433    -9562    76172  4400147 
## 
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.428e+07  1.179e+07  -6.301 3.03e-10 ***
## bedrooms      -3.369e+04  2.203e+03 -15.296  < 2e-16 ***
## bathrooms      4.165e+04  3.863e+03  10.782  < 2e-16 ***
## sqft_living    1.444e+02  5.198e+00  27.778  < 2e-16 ***
## sqft_lot       1.202e-01  5.446e-02   2.207 0.027360 *  
## floors         5.802e+03  4.248e+03   1.366 0.172069    
## waterfront     5.701e+05  2.039e+04  27.952  < 2e-16 ***
## view           5.602e+04  2.482e+03  22.572  < 2e-16 ***
## condition      2.925e+04  2.802e+03  10.439  < 2e-16 ***
## grade          9.494e+04  2.551e+03  37.217  < 2e-16 ***
## sqft_above     2.731e+01  5.154e+00   5.299 1.18e-07 ***
## sqft_basement         NA         NA      NA       NA    
## yr_built      -2.489e+03  8.579e+01 -29.018  < 2e-16 ***
## yr_renovated   2.151e+01  4.304e+00   4.997 5.89e-07 ***
## zipcode       -5.576e+02  3.909e+01 -14.265  < 2e-16 ***
## lat            6.133e+05  1.268e+04  48.360  < 2e-16 ***
## long          -2.092e+05  1.559e+04 -13.422  < 2e-16 ***
## sqft_living15  2.897e+01  4.061e+00   7.134 1.02e-12 ***
## sqft_lot15    -2.804e-01  8.390e-02  -3.342 0.000833 ***
## sale_year      3.893e+04  5.581e+03   6.976 3.16e-12 ***
## sale_month     1.914e+03  8.338e+02   2.296 0.021713 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 198800 on 15109 degrees of freedom
## Multiple R-squared:  0.7017, Adjusted R-squared:  0.7013 
## F-statistic:  1870 on 19 and 15109 DF,  p-value: < 2.2e-16
benchmark <- predict(bench_model, test)
benchmark <- ifelse(benchmark < 0, 0, benchmark)
  • 벤치마크 모델의 경우 굉장히 나쁜 성능을 보일 것은 자명하다. 밑의 과정들을 통해서 성능을 개선해보자.

Data Exploration

### Generate a heat map
ggmap(kingCounty) + 
    geom_point(data = train, aes(x = long, y = lat, color = log(price), alpha = log(price))) + 
    scale_color_gradient(low = "green", high = "red")

  • price를 로그화하여 시각화한 결과로, 남부(lat < 47.5)보다 북부(lat >= 47.5) 쪽의 가격이 더 높음을 알 수 있다.
  • 그 중에서도 해변가에 인접한 곳의 가격이 더 높다.

Correlation

cor_House <- cor(House[, -1])
corrplot(cor_House, order = "hclust")

Boxplots

Grade - Price
train %>%
    mutate(grade = factor(grade)) %>%
    ggplot(aes(x = grade, y = price, fill = grade)) +
    geom_boxplot() + 
    geom_point(
        data = train %>% 
            group_by(grade) %>%
            summarise(median = median(price)) %>%
            mutate(grade = factor(grade)),
        aes(x = grade, y = median, group = 1),
        size = 5, stroke = 2,
        color = "black", fill = "white", shape = 23
    )

  • grade가 한 단계 높아질 때마다 가격이 기하급수적으로 증가하는 것으로 보인다. 확인을 위해서 log(price)에 대해서 박스플롯을 그려본다.
train %>%
    mutate(grade = factor(grade)) %>%
    ggplot(aes(x = grade, y = log(price), fill = grade)) +
    geom_boxplot() + 
    geom_point(
        data = train %>% 
            group_by(grade) %>%
            summarise(median = median(log(price))) %>%
            mutate(grade = factor(grade)),
        aes(x = grade, y = median, group = 1),
        size = 5, stroke = 2,
        color = "black", fill = "white", shape = 23
    )

Year Build - Price
train %>%
    mutate(yr_cat = cut(yr_built, breaks = seq(1900, 2020, by = 10),
                        labels = paste0(seq(1900, 2010, by = 10), "s"))) %>%
    ggplot(aes(x = yr_cat, y = log(price), fill = yr_cat)) + 
    geom_boxplot()

  • 건물이 지어진 연대와 가격 사이에는 큰 인사이트를 얻기 힘들어 보인다.
Year Renovated - Price
train %>%
    filter(yr_renovated != 0) %>%
    mutate(renovated_cat = cut(yr_renovated, breaks = seq(1930, 2020, by = 10),
                        labels = paste0(seq(1930, 2010, by = 10), "s"))) %>%
    ggplot(aes(x = renovated_cat, y = log(price), fill = renovated_cat)) + 
    geom_boxplot()

  • 집을 개조한 경우, 최근에 개조할 수록 가격이 조금이라도 증가하는 경향을 보인다.
Is there any difference between renovated / non-renovated
train %>%
    mutate(isRenovated = factor(ifelse(yr_renovated != 0, 1, 0))) %>%
    ggplot(aes(x = isRenovated, y = log(price), fill = isRenovated)) + 
    geom_boxplot()

  • 개조한 집의 가격이 대체로 비싸게 책정됨을 알 수 있다.
Year Saled - Price / Month Saled - Price
train %>%
    mutate(sale_year = factor(sale_year)) %>%
    ggplot(aes(x = sale_year, y = log(price), fill = sale_year)) + 
    geom_boxplot()

train %>%
    mutate(sale_month = factor(sale_month)) %>%
    ggplot(aes(x = sale_month, y = log(price), fill = sale_month)) + 
    geom_boxplot()

  • 두 변수 모두 가격에 영향을 미치는 것으로 보이진 않는다.
Bathrooms - Price
train %>%
    mutate(bathrooms = factor(bathrooms)) %>%
    ggplot(aes(x = bathrooms, y = log(price), fill = bathrooms)) + 
    geom_boxplot()

  • log(price)bathrooms는 유사 선형관계를 가진다.
Coordinate - Price
train %>%
    ggplot(aes(x = lat, y = log(price), color = lat)) + 
    geom_line() + geom_point(shape = 21)

train %>%
    ggplot(aes(x = long, y = log(price), color = long)) + 
    geom_line() + geom_point(shape = 21)

  • 위도와 경도 모두 특정 영역에서 높은 가격대가 형성이 되어 있다. 변수를 새로 생성해서 영역을 분리하는 것이 도움이 될 것으로 보인다.
    • Latitude : ~47.5 / 47.5 ~ 47.6 / 47.6 ~
Zip Code - Price
sort(unique(train$zipcode)) == sort(unique(test$zipcode))
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [57] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
  • 트레이닝 데이터와 테스트 데이터에 존재하는 Zip Code는 동일하다.
train %>%
    arrange(zipcode) %>%
    mutate(zipcode = factor(zipcode)) %>%
    ggplot(aes(x = zipcode, y = log(price), fill = zipcode)) + 
    geom_boxplot() + 
    theme(axis.text.x = element_text(angle = 90, vjust = .1))

  • one-hot encoding 으로 데이터를 확장하는 것을 고려하자.

Feature Engineering

Split the latitude
splitLat <- function(data){
    data <- data %>%
        dplyr::mutate(lat1 = ifelse(lat <= 47.5, lat, 0),
                      lat2 = ifelse(lat > 47.5 & lat <= 47.6, lat, 0),
                      lat3 = ifelse(lat > 47.6, lat, 0)) %>%
        dplyr::select(-lat)
    return(data)
}

train <- splitLat(train)
test <- splitLat(test)
Is this house renovated?
train <- train %>%
    mutate(isRenovated = ifelse(yr_renovated != 0, 1, 0))

test <- test %>%
    mutate(isRenovated = ifelse(yr_renovated != 0, 1, 0))
How old is this house?
train <- train %>%
    mutate(age = ifelse(yr_renovated != 0, 2016 - yr_renovated, 2016 - yr_built))

test <- test %>%
    mutate(age = ifelse(yr_renovated != 0, 2016 - yr_renovated, 2016 - yr_built))
Zip Code (one-hot encoding)
train$zipcode <- factor(train$zipcode)
test$zipcode <- factor(test$zipcode)
zipcode_train <- data.frame(model.matrix(price ~ 0 + zipcode, data = train))
zipcode_test <- data.frame(model.matrix(price ~ 0 + zipcode, data = test))
train <- train %>%
    select(-zipcode) %>%
    cbind(zipcode_train)

test <- test %>%
    select(-zipcode) %>%
    cbind(zipcode_test)
Feature Selection

Modeling

model <- lm(log(price) ~ ., data = train)
summary(model)
## 
## Call:
## lm(formula = log(price) ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.34466 -0.09549  0.00900  0.10445  0.99991 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.718e+02  1.348e+01 -12.746  < 2e-16 ***
## bedrooms       3.475e-03  2.059e-03   1.687 0.091539 .  
## bathrooms      3.492e-02  3.607e-03   9.682  < 2e-16 ***
## sqft_living    1.293e-04  4.842e-06  26.706  < 2e-16 ***
## sqft_lot       5.831e-07  5.052e-08  11.542  < 2e-16 ***
## floors        -2.629e-02  4.304e-03  -6.107 1.04e-09 ***
## waterfront     4.803e-01  1.920e-02  25.022  < 2e-16 ***
## view           5.911e-02  2.351e-03  25.144  < 2e-16 ***
## condition      6.463e-02  2.662e-03  24.278  < 2e-16 ***
## grade          8.787e-02  2.489e-03  35.305  < 2e-16 ***
## sqft_above     7.052e-05  4.960e-06  14.218  < 2e-16 ***
## sqft_basement         NA         NA      NA       NA    
## yr_built       1.412e-05  3.319e-04   0.043 0.966069    
## yr_renovated   3.900e-03  5.273e-04   7.396 1.48e-13 ***
## long          -2.737e-01  6.348e-02  -4.312 1.63e-05 ***
## sqft_living15  8.920e-05  3.933e-06  22.679  < 2e-16 ***
## sqft_lot15     1.997e-07  7.990e-08   2.500 0.012442 *  
## sale_year      6.783e-02  5.156e-03  13.156  < 2e-16 ***
## sale_month     3.004e-03  7.699e-04   3.902 9.58e-05 ***
## lat1           2.801e-01  9.154e-02   3.060 0.002215 ** 
## lat2           2.825e-01  9.148e-02   3.088 0.002016 ** 
## lat3           2.829e-01  9.142e-02   3.094 0.001978 ** 
## isRenovated   -7.677e+00  1.046e+00  -7.339 2.26e-13 ***
## age            2.176e-04  3.375e-04   0.645 0.518993    
## zipcode98001  -6.004e-01  3.569e-02 -16.821  < 2e-16 ***
## zipcode98002  -6.168e-01  3.805e-02 -16.209  < 2e-16 ***
## zipcode98003  -5.980e-01  3.546e-02 -16.866  < 2e-16 ***
## zipcode98004   2.897e-01  2.206e-02  13.137  < 2e-16 ***
## zipcode98005  -4.895e-02  2.638e-02  -1.855 0.063556 .  
## zipcode98006  -1.367e-01  2.759e-02  -4.955 7.29e-07 ***
## zipcode98007  -1.257e-01  2.798e-02  -4.491 7.13e-06 ***
## zipcode98008  -1.270e-01  2.610e-02  -4.867 1.14e-06 ***
## zipcode98010  -2.843e-01  4.438e-02  -6.407 1.53e-10 ***
## zipcode98011  -3.848e-01  2.631e-02 -14.625  < 2e-16 ***
## zipcode98014  -3.790e-01  4.221e-02  -8.978  < 2e-16 ***
## zipcode98019  -4.396e-01  3.590e-02 -12.246  < 2e-16 ***
## zipcode98022  -4.537e-01  4.833e-02  -9.388  < 2e-16 ***
## zipcode98023  -6.622e-01  3.506e-02 -18.890  < 2e-16 ***
## zipcode98024  -2.732e-01  4.449e-02  -6.141 8.39e-10 ***
## zipcode98027  -1.881e-01  3.233e-02  -5.816 6.15e-09 ***
## zipcode98028  -4.212e-01  2.326e-02 -18.107  < 2e-16 ***
## zipcode98029  -1.141e-01  3.418e-02  -3.339 0.000842 ***
## zipcode98030  -5.423e-01  3.469e-02 -15.632  < 2e-16 ***
## zipcode98031  -5.332e-01  3.254e-02 -16.388  < 2e-16 ***
## zipcode98032  -6.520e-01  3.605e-02 -18.085  < 2e-16 ***
## zipcode98033  -3.718e-02  2.174e-02  -1.710 0.087260 .  
## zipcode98034  -2.914e-01  2.131e-02 -13.677  < 2e-16 ***
## zipcode98038  -3.837e-01  3.739e-02 -10.261  < 2e-16 ***
## zipcode98039   4.156e-01  3.461e-02  12.007  < 2e-16 ***
## zipcode98040   7.663e-02  2.651e-02   2.890 0.003853 ** 
## zipcode98042  -5.178e-01  3.494e-02 -14.820  < 2e-16 ***
## zipcode98045  -1.802e-01  4.815e-02  -3.743 0.000183 ***
## zipcode98052  -1.569e-01  2.412e-02  -6.505 8.04e-11 ***
## zipcode98053  -1.789e-01  2.956e-02  -6.051 1.47e-09 ***
## zipcode98055  -4.772e-01  3.013e-02 -15.838  < 2e-16 ***
## zipcode98056  -4.080e-01  2.753e-02 -14.821  < 2e-16 ***
## zipcode98058  -4.548e-01  3.093e-02 -14.703  < 2e-16 ***
## zipcode98059  -3.240e-01  2.930e-02 -11.056  < 2e-16 ***
## zipcode98065  -2.798e-01  4.155e-02  -6.734 1.71e-11 ***
## zipcode98070  -3.816e-01  3.460e-02 -11.029  < 2e-16 ***
## zipcode98072  -3.324e-01  2.741e-02 -12.128  < 2e-16 ***
## zipcode98074  -2.089e-01  2.808e-02  -7.440 1.06e-13 ***
## zipcode98075  -1.880e-01  3.288e-02  -5.718 1.10e-08 ***
## zipcode98077  -3.618e-01  3.144e-02 -11.505  < 2e-16 ***
## zipcode98092  -5.397e-01  3.801e-02 -14.198  < 2e-16 ***
## zipcode98102   1.036e-01  2.557e-02   4.053 5.09e-05 ***
## zipcode98103  -4.685e-02  1.643e-02  -2.851 0.004358 ** 
## zipcode98105   8.994e-02  2.070e-02   4.345 1.40e-05 ***
## zipcode98106  -4.843e-01  2.412e-02 -20.077  < 2e-16 ***
## zipcode98107  -3.243e-02  1.897e-02  -1.709 0.087475 .  
## zipcode98108  -4.541e-01  2.633e-02 -17.245  < 2e-16 ***
## zipcode98109   1.567e-01  2.431e-02   6.446 1.18e-10 ***
## zipcode98112   2.177e-01  1.997e-02  10.899  < 2e-16 ***
## zipcode98115  -3.640e-02  1.718e-02  -2.119 0.034114 *  
## zipcode98116  -6.915e-02  2.326e-02  -2.972 0.002959 ** 
## zipcode98117  -6.207e-02  1.637e-02  -3.792 0.000150 ***
## zipcode98118  -3.440e-01  2.411e-02 -14.270  < 2e-16 ***
## zipcode98119   1.248e-01  2.083e-02   5.992 2.12e-09 ***
## zipcode98122  -3.056e-02  1.935e-02  -1.579 0.114352    
## zipcode98125  -2.821e-01  1.917e-02 -14.713  < 2e-16 ***
## zipcode98126  -2.725e-01  2.384e-02 -11.433  < 2e-16 ***
## zipcode98133  -4.175e-01  1.852e-02 -22.543  < 2e-16 ***
## zipcode98136  -1.397e-01  2.514e-02  -5.556 2.80e-08 ***
## zipcode98144  -1.549e-01  2.358e-02  -6.571 5.16e-11 ***
## zipcode98146  -4.872e-01  2.546e-02 -19.133  < 2e-16 ***
## zipcode98148  -4.666e-01  3.982e-02 -11.716  < 2e-16 ***
## zipcode98155  -4.416e-01  2.033e-02 -21.714  < 2e-16 ***
## zipcode98166  -3.704e-01  2.870e-02 -12.903  < 2e-16 ***
## zipcode98168  -6.160e-01  2.705e-02 -22.773  < 2e-16 ***
## zipcode98177  -2.814e-01  2.055e-02 -13.696  < 2e-16 ***
## zipcode98178  -5.611e-01  2.746e-02 -20.433  < 2e-16 ***
## zipcode98188  -5.367e-01  3.216e-02 -16.688  < 2e-16 ***
## zipcode98198  -5.939e-01  3.122e-02 -19.024  < 2e-16 ***
## zipcode98199          NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1832 on 15037 degrees of freedom
## Multiple R-squared:  0.8798, Adjusted R-squared:  0.8791 
## F-statistic:  1210 on 91 and 15037 DF,  p-value: < 2.2e-16
Evaluation Metric: RMSLE & MSE

평가 메트릭으로 Root Mean Squared Logarithmic Error(RMSLE)Mean Squared Error(MSE)를 사용한다. RMSLE 메트릭은 과대평가된 항목보다는 과소평가된 항목에 페널티를 준다.

\[ RMSLE = \sqrt{\frac{1}{n} \sum^n_{i=1} \left( \log(p_i + 1) - \log(a_i + 1)\right)^2} \]

rmsle <- function(predict, actual){
    if(length(predict) != length(actual))
        stop("The length of two vectors are different.")
    
    len <- length(predict)
    rmsle <- sqrt((1/len) * sum((log(predict + 1) - log(actual + 1))^2))
    return(rmsle)
}

MSE는 다음과 같다.

\[ MSE = \sqrt{\frac{1}{n} \sum^n_{i=1} (p_i - a_i)^2} \]

mse <- function(predict, actual){
    if(length(predict) != length(actual))
        stop("The length of two vectors are different.")
    
    len <- length(predict)
    mse <- sqrt((1/len) * sum((predict - actual)^2))
    return(mse)
}
Test
pred <- predict(model, test)
pred <- exp(pred)

result.rmsle <- rmsle(pred, test$price)
benchmark.rmsle <- rmsle(benchmark, test$price)
cat("RMSLE (Benchmark): ", benchmark.rmsle, "\nRMSLE (Final): ", result.rmsle)
## RMSLE (Benchmark):  0.9280684 
## RMSLE (Final):  0.1794722
result.mse <- mse(pred, test$price)
benchmark.mse <- mse(benchmark, test$price)
cat("MSE (Benchmark):", benchmark.mse, "\nMSE (Final):", result.mse)
## MSE (Benchmark): 204989.7 
## MSE (Final): 193444.6

Regularization Methods

Ridge Regression
library(glmnet)

set.seed(2017)
lambda <- exp(-seq(7, 8, length.out = 400))

ridge.cv <- cv.glmnet(
    x = as.matrix(train[, -1]),
    y = log(train$price),
    alpha = 0,
    lambda = lambda,
    parallel = TRUE
)
ridge.pred <- predict(ridge.cv, as.matrix(test[, -1]), s = ridge.cv$lambda.min)
ridge.pred <- as.vector(exp(ridge.pred))
ridge.rmsle <- rmsle(ridge.pred, test$price)
cat("RMSLE (Ridge):", ridge.rmsle)
## RMSLE (Ridge): 0.1797213
ridge.mse <- mse(ridge.pred, test$price)
cat("MSE (Ridge):", ridge.mse)
## MSE (Ridge): 192229.2
Lasso
set.seed(2017)
lambda <- exp(-seq(10, 11, length.out = 400))

lasso.cv <- cv.glmnet(
    x = as.matrix(train[, -1]),
    y = log(train$price),
    alpha = 1,
    lambda = lambda,
    parallel = TRUE
)
lasso.pred <- predict(lasso.cv, as.matrix(test[, -1]), s = lasso.cv$lambda.min)
lasso.pred <- as.vector(exp(lasso.pred))
lasso.rmsle <- rmsle(lasso.pred, test$price)
cat("RMSLE (Lasso):", lasso.rmsle)
## RMSLE (Lasso): 0.1797293
lasso.mse <- mse(lasso.pred, test$price)
cat("MSE (Lasso):", lasso.mse)
## MSE (Lasso): 192438.7
Weighted Lasso
# Adaptive Lasso Function with Automatic 10-fold CV
adaLasso <- function(data, labels, parallel = TRUE, standardize = TRUE, weight, gamma = 1, formula = NULL, ols.data = NULL, ridge.lambda = NULL, lasso.lambda = NULL, seed = 1){
    require(glmnet)
    if(!(weight %in% c("ols", "ridge"))){
        stop("The parameter 'weight' should be chosen either ols or ridge.")
    }
    if(weight == "ols"){
        if(is.null(ols.data)){
            stop("If you want to use the coefficients of OLS as the weight for Adaptive Lasso, you have to put a data.frame to ols.data argument.")
        }
        ols <- lm(formula = formula, data = ols.data)
        weight <- 1/abs(as.matrix(coefficients(ols)[-1]))^gamma
    }
    if(weight == "ridge"){
        set.seed(seed)
        if(parallel)
            doMC::registerDoMC(cores = 4)
        
        cv.ridge <- cv.glmnet(x = data, y = labels, alpha = 0, parallel = parallel, standardize = standardize, lambda = ridge.lambda)
        weight <- 1/abs(matrix(coef(cv.ridge, s = cv.ridge$lambda.min)[-1, ]))^gamma
    }
    weight[,1][weight[, 1] == Inf] <- 999999999
    set.seed(seed)
    if(parallel)
        doMC::registerDoMC(cores = 4)
    cv.lasso <- cv.glmnet(x = data, y = labels, alpha = 1, parallel = parallel, standardize = standardize, lambda = lasso.lambda, penalty.factor = weight)
    cv.lasso
}
set.seed(2017)
ridge.lambda <- exp(-seq(7, 8, length.out = 400))
lasso.lambda <- exp(-seq(9, 10, length.out = 400))

adaLasso.cv <- adaLasso(
    data = as.matrix(train[, -1]),
    label = log(train$price),
    weight = "ridge",
    ridge.lambda = ridge.lambda,
    lasso.lambda = lasso.lambda,
    parallel = TRUE
)
adaLasso.pred <- predict(adaLasso.cv, as.matrix(test[, -1]), s = adaLasso.cv$lambda.min)
adaLasso.pred <- as.vector(exp(adaLasso.pred))
adaLasso.rmsle <- rmsle(adaLasso.pred, test$price)
cat("RMSLE (adaLasso):", adaLasso.rmsle)
## RMSLE (adaLasso): 0.1796296
adaLasso.mse <- mse(adaLasso.pred, test$price)
cat("MSE (adaLasso):", adaLasso.mse)
## MSE (adaLasso): 190009.3
Comparison
library(reshape2)

reg.result <- data.frame(Method = c("Least Square", "Ridge\nRegression", "Lasso", "Adaptive Lasso"),
                         RMSLE = c(result.rmsle, ridge.rmsle, lasso.rmsle, adaLasso.rmsle),
                         MSE = c(result.mse, ridge.mse, lasso.mse, adaLasso.mse))

reg.result <- melt(reg.result, id.vars = "Method",
                   variable.name = "Metric",
                   value.name = "Score")

reg.result$Method <- factor(reg.result$Method,
                              levels = c("Least Square", "Lasso", "Ridge\nRegression", "Adaptive Lasso"))

ggplot(reg.result, aes(x = Method, y = Score, group = Metric)) + 
    geom_line() + geom_point(aes(color = Method, shape = Method), size = 5) + 
    facet_grid(Metric ~ ., scales = "free_y") + 
    geom_text(aes(label = ifelse(Score < 1, round(Score, 7), round(Score, 1))),
              size = 3, hjust = 1.3, vjust = 1.3, fontface = 'bold')

Tree Based Method

Random Forest
library(ranger)

set.seed(2017)
rf <- ranger(
    formula = log(price) ~ .,
    data = train, 
    num.trees = 2000,
    importance = 'impurity',
    write.forest = TRUE
)
## Growing trees.. Progress: 19%. Estimated remaining time: 2 minutes, 8 seconds.
## Growing trees.. Progress: 38%. Estimated remaining time: 1 minute, 39 seconds.
## Growing trees.. Progress: 57%. Estimated remaining time: 1 minute, 9 seconds.
## Growing trees.. Progress: 77%. Estimated remaining time: 37 seconds.
rf.pred <- predict(rf, test)
rf.pred <- exp(rf.pred$predictions)
rf.rmsle <- rmsle(rf.pred, test$price)
cat("RMSLE (Random Forest):", rf.rmsle)
## RMSLE (Random Forest): 0.1775066
rf.mse <- mse(rf.pred, test$price)
cat("MSE (Random Forest):", rf.mse)
## MSE (Random Forest): 155200.4
XGBoost
library(xgboost)

params <- list(
    eta = 0.3,
    gamma = 0,
    max_depth = 5,
    min_child_weight = 1,
    subsample = 1,
    colample_bytree = 1,
    objective = "reg:linear",
    eval_metric = "rmse"
)

set.seed(2017)

xgb.cv <- xgb.cv(
    params = params,
    data = as.matrix(train[, -1]),
    nrounds = 200,
    nfold = 10,
    label = log(train$price),
    verbose = 1,
    print_every_n = 25
)
## [1]  train-rmse:8.797288+0.000701    test-rmse:8.797174+0.008145 
## [26] train-rmse:0.163139+0.001116    test-rmse:0.180500+0.006925 
## [51] train-rmse:0.142965+0.001086    test-rmse:0.171362+0.006885 
## [76] train-rmse:0.131548+0.000993    test-rmse:0.168632+0.006659 
## [101]    train-rmse:0.122853+0.001172    test-rmse:0.167706+0.006633 
## [126]    train-rmse:0.115637+0.001038    test-rmse:0.166971+0.006473 
## [151]    train-rmse:0.109521+0.000955    test-rmse:0.166733+0.006552 
## [176]    train-rmse:0.104160+0.001013    test-rmse:0.166646+0.006674 
## [200]    train-rmse:0.099301+0.000813    test-rmse:0.166665+0.006460
best.xgb <- xgb.cv$evaluation_log %>%
    arrange(test_rmse_mean, test_rmse_std) %>%
    head(1)
best.xgb
iter train_rmse_mean train_rmse_std test_rmse_mean test_rmse_std
161 0.1073536 0.0011955 0.1665503 0.0066255
iter <- best.xgb$iter

xgb <- xgboost(
    data = as.matrix(train[, -1]),
    nrounds = iter,
    label = log(train$price),
    verbose = 1,
    print_every_n = 25
)
## [1]  train-rmse:8.797178 
## [26] train-rmse:0.151631 
## [51] train-rmse:0.130160 
## [76] train-rmse:0.116726 
## [101]    train-rmse:0.107669 
## [126]    train-rmse:0.098944 
## [151]    train-rmse:0.092342 
## [161]    train-rmse:0.090079
xgb.pred <- predict(xgb, as.matrix(test[, -1]))
xgb.pred <- exp(xgb.pred)
xgb.rmsle <- rmsle(xgb.pred, test$price)
cat("RMSLE (Xgboost):", xgb.rmsle)
## RMSLE (Xgboost): 0.1654061
xgb.mse <- mse(xgb.pred, test$price)
cat("MSE (Xgboost):", xgb.mse)
## MSE (Xgboost): 124287.6

Final Comparison

library(caret)
result.r2 <- postResample(pred, test$price)[2]
ridge.r2 <- postResample(ridge.pred, test$price)[2]
lasso.r2 <- postResample(lasso.pred, test$price)[2]
adaLasso.r2 <- postResample(adaLasso.pred, test$price)[2]
rf.r2 <- postResample(rf.pred, test$price)[2]
xgb.r2 <- postResample(xgb.pred, test$price)[2]
final.result <- data.frame(Method = c("Least Square", "Ridge\nRegression", "Lasso", "Adaptive\nLasso", "Random\nForest", "XGBoost"),
                         MSE = c(result.mse, ridge.mse, lasso.mse, adaLasso.mse, rf.mse, xgb.mse),
                         RMSLE = c(result.rmsle, ridge.rmsle, lasso.rmsle, adaLasso.rmsle, rf.rmsle, xgb.rmsle),
                         R2 = c(result.r2, ridge.r2, lasso.r2, adaLasso.r2, rf.r2, xgb.r2))

final.result <- melt(final.result, id.vars = "Method",
                   variable.name = "Metric",
                   value.name = "Score")

final.result$Method <- factor(final.result$Method,
                              levels = c("Least Square", "Lasso", "Ridge\nRegression", "Adaptive\nLasso", "Random\nForest", "XGBoost"))

ggplot(final.result, aes(x = Method, y = Score, group = Metric)) + 
    geom_line(linetype = "dashed") + geom_point(aes(color = Method, shape = Method), size = 5) + 
    facet_grid(Metric ~ ., scales = "free_y") + 
    geom_text(aes(label = ifelse(Score < 1, round(Score, 7), round(Score, 1))),
              size = 2.5, hjust = 1.2, vjust = 1.4, fontface = 'bold.italic') + 
    theme(legend.key.size = unit(2, 'lines'))

In addition,

  • 사실 핸들링을 마친 데이터에는 다중공선성이 있었다. 위도와 경도를 제거하면 해당 다중공선성이 제거된다. 다중공선성를 제거하더라도 위의 성능 그래프에서 차이가 발생하지는 않는다.
library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif <- vif((lm(log(price) ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
    floors + waterfront + view + condition + grade + sqft_above + 
    long + sqft_living15 + sqft_lot15 + lat1 + lat2 + lat3 +
    isRenovated + age + zipcode98001 + zipcode98002 + 
    zipcode98003 + zipcode98004 + zipcode98005 + zipcode98006 + 
    zipcode98007 + zipcode98008 + zipcode98010 + zipcode98011 + 
    zipcode98014 + zipcode98019 + zipcode98022 + zipcode98023 + 
    zipcode98024 + zipcode98027 + zipcode98028 + zipcode98029 + 
    zipcode98030 + zipcode98031 + zipcode98032 + zipcode98033 + 
    zipcode98034 + zipcode98038 + zipcode98039 + zipcode98040 + 
    zipcode98042 + zipcode98045 + zipcode98052 + zipcode98053 + 
    zipcode98055 + zipcode98056 + zipcode98058 + zipcode98059 + 
    zipcode98065 + zipcode98070 + zipcode98072 + zipcode98074 + 
    zipcode98075 + zipcode98077 + zipcode98092 + zipcode98102 + 
    zipcode98103 + zipcode98105 + zipcode98106 + zipcode98107 + 
    zipcode98108 + zipcode98109 + zipcode98112 + zipcode98115 + 
    zipcode98116 + zipcode98117 + zipcode98118 + zipcode98119 + 
    zipcode98122 + zipcode98125 + zipcode98126 + zipcode98133 + 
    zipcode98136 + zipcode98144 + zipcode98146 + zipcode98148 + 
    zipcode98155 + zipcode98166 + zipcode98168 + zipcode98177 + 
    zipcode98178 + zipcode98188 + zipcode98198, data = train)))
names(vif[vif > 10])
##  [1] "long"         "lat1"         "lat2"         "lat3"        
##  [5] "zipcode98022" "zipcode98023" "zipcode98038" "zipcode98042"
##  [9] "zipcode98045" "zipcode98065" "zipcode98092"